\chapter{Background}

\section{Ethnic Background}
Thailand covers an area of 514,000 km2 in the center of the Southeast Asian peninsula. It is bordered by Myanmar (Burma), Lao People's Democratic Republic (Laos), Cambodia and Malaysia, and has 2,420 km of coastline on the Gulf of Thailand and the Andaman Sea. Thailand stretches 1,650 kilometers from north to south, and 780 kilometers from east to west at its widest part (http://www.un.or.th/thailand/geography.html). The estimated population is 64 million, of which approximately 9.3 million live in Bangkok, the capital city, and its vicinity. The official language of the country is called ``Thai''
 for which 94\% of population use as their first language. Four major dialects of the Thai language are the dialects used in the central, northern, southern and northeastern regions. Northeastern dialect is closely related to the Lao language. In the four southern most provinces of Pattani, Satun, Yala and Naratiwat situated near the Malaysian border, majority of the population there is Muslim speaking "Pattani" Malay. In the mountainous area of the northern region, there are approximately 525,000 highland people or hill tribes who speak distinct languages. Ten to fifteen percent of populations have Chinese origin due to steady flows of immigration from China to Thailand during 1850 toward the end of the World War II. Thus, the Chinese population in Thailand was established as commerce and artisan communities throughout the country (http://www.un.or.th/thailand/population.html).

In terms of ethnic-specific Mendelian disease backgrounds, Thalassemia is the most common genetic disease in Thailand. Mutation carriers were estimated at 30--40\% of the Thai population. The high prevalence of Thalassemia carriers results in more than 12,000 new cases of severely affected births annually in Thailand (Lagampan, et al., 2004). For other genetic related diseases, cancer is also a major health problem and has been the most common cause of death since 1999 (Ministry of Public Health, 2004). In Thai men, liver cancer is the most common disease followed by lung cancer. On the other hand, cervix and breast cancer are the two top cancer types prevalent in Thai women (Sriplung, et al., 2003).

\section{Related Works on Human Mutation}
Mutations in human gene pathology and evolution represent two sides of the same coin in that the same mechanisms that have frequently been implicated in disease-associated mutagenesis appear also to have been involved in potentiating evolutionary change. Indeed, the mutational spectra of germline mutations responsible for inherited disease, somatic mutations underlying tumorigenesis, polymorphisms (either neutral or functionally significant) and differences between orthologous gene sequences exhibit remarkable similarities, implying that they may have causal mechanisms in common. Since these different categories of mutation share multiple unifying characteristics, they should no longer be viewed as distinct entities but rather as portions of a continuum of genetic change that links population genetics and molecular medicine with molecular evolution.

The phenotype is the result of ontogenetic development. This holds true also at the molecular level, because molecular biological processes take place within the organism. In ontogenesis, genetic and nongenetic factors interact in producing successive states, each of which is the prerequisite, and determines the conditions, for the next one to follow. In this interplay, genes are a necessary, but not sufficient, component. The structures already present, gradients, threshold values, positional relationships, and conditions of the internal milieu, are equally essential. Thus, even monofactorial traits can be considered to be of multifactorial causation, and the varying borderline conditions that arise during development add to the complexity. From this standpoint, it is not to be expected that a mutation has a consistent phenotypic outcome, and the genotype-phenotype relationship may be irregular. Further more, genotypic heterogeneity versus phenotypic heterogeneity is discussed with the help of some selected examples of hereditary diseases. Conditions and mechanisms contributing to this heterogeneity are addressed. It is concluded that the genotype-phenotype relationship is neither unidimensional, programmatical nor hierarchical in a strict sense. Nevertheless, in particular cases, ontogenetic modification appears to be of minor significance, so that the phenotype of a mutation can be predicted with considerable accuracy. This is no surprise if, depending on the nature of the mutation and the physiological function of the gene affected, the genotype-phenotype relationship is direct. However, this relationship may also be consistent in more complex conditions. It is assumed that the total of the non-genetic influences (epigenetic, environmental) are usually so similar or are compensated by the organism to such an extent that the respective mutation acts as the major variable during ontogenetic development.

The relationship between pathogenetic mutations and disease phenotype is becoming increasingly complex. Well-delineated clinical entities can be genetically heterogeneous, and mutations in a particular gene may result in fundamental clinical differences. Genetic heterogeneity includes mutations at different gene loci or allelic mutations within a single gene, resulting in a similar phenotype. However, one and the same mutation is expected to be associated with a uniform clinical picture. In the present article, evidence is presented that this is not necessarily the case, and examples of identical mutations resulting in highly variable combinations of clinical features are discussed. Although the number of examples of this puzzling phenomenon is rapidly increasing, the underlying mechanisms are as yet poorly understood. In some cases, interacting genetic alterations can be held responsible for the phenotypic heterogeneity; in others, epigenetic phenomena provide a plausible explanation. These and other mechanisms under discussion are considered here. The Mendelian concept of monofactorial disease causation appears to be increasingly untenable for a growing number of developmental errors.

The human genome has somewhere around 30,000 genes(~ref). If we consider that some genes such as cystic fibrosis have nearly 1000 mutations causing this rare inherited disorder, it is possible that there may be up to 30  106 mutations causing single gene disorders if mutations in all genes cause disease. A more conservative figure is 3  106. If we consider also non-disease causing polymorphisms that are thought to occur every 200-1000 bases in the 3  109 genome, we arrive at 3-15 million possible polymorphisms. In the case of polymorphisms these are important in common disease, in variation in drug metabolism and as markers in linkage studies. When one considers single base changes in the 3  109 bases and that each of these can change to one of three others, there are potentially 9  109 base changes possible (without insertions or deletions). Thus it is clear that there are likely to be at least tens of millions of base changes that are important to human health. In the case of single gene disorders, each mutational event needs to be characterized by at least 10 extra pieces of data, ideally more like 50,2 whereas polymorphisms perhaps need less. This means that there are at least hundreds of millions of pieces of data that are needed to fully record variation in the human genome. This is only one order of magnitude less than the task of recording the human genome sequence of 3  109 units. Thus it is in the interest of medical science that a system be put in place to systematically collect accurate variation data, safely store it, and make it available to those who need the data.
Since these early developments there has been an expansion of numbers of databases. Those databases collecting mutations in single genes are called locus specific mutation databases (LSDBs), whereas those collecting mutation in all or many genes are referred to as central or general mutation databases.

\section{Related Works on Human Mutation Database and LSDB}

There're many mutation database established since the completion of human genome project. In 1973, scientists assembled at the first Human Gene Mapping Workshop to discuss the 64 human genes mapped at that time. In 1989, the GDB Human Genome Database was created to store information on 1, 700 mapped human genes. Ten years later, as the human genome project closes in on the release of the complete DNA sequence holding as many as 100,000 human genes, GDB is evolving to continue to meet the needs of the scientific community. Well known as a resource for data which has been stringently reviewed as part of the curation process, GDB prepares to continue to provide a compilation of the human genome including maps, map objects, polymorphisms, and mutations. As more sites across the Internet are established to share biological information, it becomes increasingly burdensome for the scientist to collect data from all sources of a particular domain. In an attempt to reduce this burden, GDB continues to load data from large genome centres and accept submissions from researchers around the world. Moreover, GDB looks to provide a mechanism to link gene-related information to the human reference sequence. In doing this, GDB plans to establish federated linkages with "boutique" databases around the world that could contain enormous amounts of valuable information about specific genes or chromosomes. 

In 2006, Human Protein Reference Database (HPRD) (http://www.hprd.org) was developed to serve as a comprehensive collection of protein features, post-translational modifications (PTMs) and protein-protein interactions. Since the original report, this database has increased to more than 20,000 proteins entries and has become the largest database for literature-derived protein-protein interactions ( more than 30,000) and PTMs ( more than 8000) for human proteins. They are introduced several new features in HPRD including: (i) protein isoforms, (ii) enhanced search options, (iii) linking of pathway annotations and (iv) integration of a novel browser, GenProt Viewer (http://www.genprot.org), that they developed allows integration of genomic and proteomic information. With the continued support and active participation by the biomedical community, HPRD are expected to become a unique source of curated information for the human proteome and spur biomedical discoveries based on integration of genomic, transcriptomic and proteomic data.

HUGO Mutation Database Initiative in 1994 house the data of which been supported by the Human Genome Organization (HUGO) and the March of Dimes and has around 600 members in 34 countries. The main overarching objective in achieving the aims of the HUGO MDI has been to combine the strengths of the central database and the LSDBs. Thus in broad terms the Initiative set out to establish a federation of LSDB curators to ensure capture and work with central databases to ensure storage and distribution on a proper bioinformatics basis. In this database, mutation nomenclature was an early concern, as whilst there were several systems in use, proper discussion with consequent recommendations had never occurred. The outcome of such a process has resulted in a HUGO-MDI recommended nomenclature for the simple changes with a further discussion for more complex mutations. The most daunting problem is how to ensure complete collection of all variation that is being uncovered. This problem is being compounded by the fact that journals are generally not accepting reports of single mutations after the initial wave once a disease gene is discovered. This is especially so for the 452nd mutation causing PKU or even a group of them. Initially the journal Human Mutation accepted such publications electronically and published them electronically but this has ceased. The Initiative members have thus been moved to plan an integrated system of receival, review, publication, PubMed ID registration, and public storage. This has resulted in a pilot receival point, the 'WayStation'25 and agreement for publication of data by Wiley-Liss in Human Mutation and agreement by HGBASE to be the storage database for the data.  To solve this problem in HUGO, another approach to ensuring mutation capture has been to encourage National Databases who are likely to be able to contact all diagnostic and research laboratories in their country to induce collection of mutations. One such database is the Turkish database. Besides ensuring mutation collection, such national (or ethnic) databases are a vital aid to delivery of national genetic health care. Because of the past and current huge transnational migration such national/ ethnic databases are of international importance.

After that, documentation of LSDBs as an aid to research and clinical was  was posted on the HUGO/MDI website (now the HGVS website; http://www.hgvs.org) in early 1998 containing 209 databases; this listing was later published. The listing has grown over the years, making it increasingly difficult to maintain; thus, a new database of LSDBs was created as a relational database on a MySQL database platform (http://www.mysql.com) to make curation of these sites easier. In January 2006, a program was initiated to update and add unlisted LSDBs. Dead links were investigated, and curators were contacted to create new links. This process led to the permanent deletion of four LSDBs and the addition of 176 more LSDBs from various sources, 75 of which were from the Retina International Scientific Newsletter Mutation Databases (see below) and 72 from the IMT Bioinformatic Groups Mutation Databases. The latter two sets could perhaps be called aggregated databases. With Retina International, it appears that the databases are derived directly from the literature. The latest listing (9 March 2007) now includes 672 LSDBs and is likely to grow (http://www.hgvs.org/dblist/glsdb.html). This number represents 32\% of genes in which at least one mutation has been reported (according to the Human Gene Mutation Database (HGMD); 2,056 genes, as of 9 March 2007).

After the initiative, now, The Human Gene Mutation Database (HGMD) is regard as of the most  comprehensive core collection of data on germ-line mutations in nuclear genes underlying or associated with human inherited disease (www.hgmd.org). HGMD represents an attempt to collate known (published) gene lesions responsible for human inherited disease. HGMD comprises various types of germ-line mutation within the coding, splicing and regulatory regions of human nuclear genes. It is currently (1 December 2007) contains 76,011 different mutations in 2876 human genes. After that, in 2009, they have recently added a number of new features in HPRD. These include PhosphoMotif Finder, which allows users to find the presence of over 320 experimentally verified phosphorylation motifs in proteins of interest. Another new feature is a protein distributed annotation system--Human Proteinpedia (http://www.humanproteinpedia.org/)--through which laboratories can submit their data, which is mapped onto protein entries in HPRD. Over 75 laboratories involved in proteomics research have already participated in this effort by submitting data for over 15,000 human proteins. The submitted data includes mass spectrometry and protein microarray-derived data, among other data types. Finally, HPRD is also linked to a compendium of human signaling pathways developed by our group, NetPath (http://www.netpath.org/), which currently contains annotations for several cancer and immune signaling pathways. Since the last update, more than 5500 new protein sequences have been added, making HPRD a comprehensive resource for studying the human proteome.

Data catalogued in HGMD includes: single base-pair substitutions in coding, regulatory and splicing-relevant regions; micro-deletions and micro-insertions; indels; triplet repeat expansions as well as gross deletions; insertions; duplications; and complex rearrangements. Each mutation is entered into HGMD only once in order to avoid confusion between recurrent and identical-by-descent lesions. By March 2003, the database contained in excess of 39,415 different lesions detected in 1,516 different nuclear genes, with new entries currently accumulating at a rate exceeding 5,000 per annum. Since its inception, HGMD has been expanded to include cDNA reference sequences for more than 87\% of listed genes, splice junction sequences, disease-associated and functional polymorphisms, as well as links to data present in publicly available online locus-specific mutation databases. Although HGMD has recently entered into a licensing agreement with Celera Genomics (Rockville, MD).

For HGMD, although 20 years have elapsed since the first single basepair substitution underlying an inherited disease in humans was characterised at the DNA level, the initiative has only recently been taken to establish central database resources for pathological genetic variants. Disease-associated gene lesions are currently collected and publicised by the Human Gene Mutation Database (HGMD) in Cardiff, locus-specific mutation databases, and to some extent also by the Genome Database (GDB) and Online Mendelian Inheritance in Man (OMIM). To date, HGMD represents the only comprehensive and publicly available database of gene lesions underlying human inherited disease. By July 1999, HGMD contained over 18,000 different mutations from some 900 human genes, the majority being single basepair substitutions. In addition to its potential as an information resource for clinicians and genetic counsellors, HGMD has allowed molecular geneticists to address a variety of biological questions through meta-analysis of the collated data. HGMD also promises to assist research workers in optimising mutation search strategies for a given gene. A questionnaire sent out to, and answered by, the editors of 20 key journals revealed that human genetics journals are increasingly reluctant to publish mutation reports. Electronic data submission and publication facilities are therefore urgently required. The World Wide Web (WWW) provides an excellent medium within which to combine the centralised management of basic mutation data, including rigorous quality control, with the possibility of publishing additional mutation-related information. In response to these needs, HGMD has both instituted a collaboration with Springer-Verlag GmbH, Heidelberg, to potentiate free online submission and electronic publication of human gene mutation data and developed links with the curators of locus-specific mutation databases. 

For the usage of HGMD, they has recently introduced a user registration scheme, which is free for users from academic/non-profit organizations. Prior registration is required to access and use HGMD. After completing the registration form, users are sent a password by email, which they can use to log on to the public HGMD website. Since the inception of the system in April 2006, over 23,000 user registrations have been recorded and HGMD is continuing to accrue about 800 new registrations every month. We have registered users from over 150 different countries (Table 3), providing an indication of how widely HGMD is used by the academic community worldwide. Each month, an average of 14,000 queries for genes are received (with an equal number accessing HGMD genes via an external link) from almost 6,000 users, with a total of over 160,000 pages served. The Advanced Search is essentially a suite of software tools, available as part of HGMD Professional, which are designed to enhance mutation searching, viewing and retrieval. Two of the main types of mutation in HGMD (single-nucleotide substitutions and microlesions) can be interrogated with this toolset. The datasets of each mutation type can be combined (for example, micro-deletions, microinsertions and indels) to enable more powerful searching across comparable types of mutation. When using the Advanced Search, users can tailor their queries with more specific criteria, including amino-acid exchange; nucleotide substitution; the size of a micro-deletion, microinsertion or indel; composition; motifs (both those created and those abolished by the mutation); dbSNP number; and keywords in the article title or abstract. Any mutation results returned by the Advanced Search can be downloaded in a tab-delimited format, ready to import into a different application. Part of the Advanced Search tool includes a dynamic mutation viewer, which depicts coding-region mutations superimposed on the cDNA sequence of a gene. The wild-type cDNA sequence is represented in black whereas the mutated nucleotides are shown in different colors according to the type of mutation. Displays of each mutation type can be switched on or off using the appropriate buttons.

A current limitation of HGMD with regard to recording disease-associated polymorphic variants of functional significance within HGMD is the inclusion of only a single literature reference for each variant. A large proportion of those papers reporting a novel association between a disease and a polymorphic variant do not include functional data on that variant. HGMD will in the future address this by implementing a dual referencing system for polymorphisms: reference 1 will correspond to the first report demonstrating a functional effect (or disease-association) that meets the HGMD inclusion criteria, whereas reference 2 will (where appropriate) provide evidence of the first disease-association (or functional effect) of the polymorphism.

Several other databases have attempted to collate known polymorphism-disease associations but have met with only partial success owing to an over-reliance on computerized search procedures and automated data collection. This methodology tends to result in the creation of a database that comprises either verbatim and/or often inconsistent records of the disease-associated variants, or merely a list of PubMed citations rather than the actual variants in question. Polymorphism-disease association data curated in this way are also likely to comprise markers that occur in linkage disequilibrium with the presumed disease-associated/functional variants rather than being of functional significance themselves. We on the HGMD team believe that a manually curated database provides a rather better solution. Indeed, HGMD is currently the only database that focuses specifically on the collation of functional/disease-associated polymorphic variants to the exclusion of linkage markers.

the National Center for Biotechnology Information (NCBI) has established the dbSNP database [S.T.Sherry, M.Ward and K. Sirotkin (1999) Genome Res., 9, 677-679] in response to a need for a general catalog of genome variation to address the large-scale sampling designs required by association studies, gene mapping and evolutionary biology,  Submissions to dbSNP will be integrated with other sources of information at NCBI such as GenBank, PubMed, LocusLink and the Human Genome Project data. dbSNP currently classifies nucleotide sequence variations with the following types and percentage composition of the database: (i) single nucleotide substitutions, 99.77\%; (ii) small insertion/deletion polymorphisms, 0.21\%; (iii) invariant regions of sequence, 0.02\%; (iv) microsatellite repeats, 0.001\%; (v) named variants, <0.001\%; and (vi) uncharacterized heterozygous assays, <0.001\%. There is no requirement or assumption about minimum allele frequencies or functional neutrality for the polymorphisms in the database. Thus, the scope of dbSNP includes disease-causing clinical mutations as well as neutral polymorphisms. In addition to the record identifiers assigned by both the submitter and NCBI, dbSNP entries record the sequence information around the polymorphism, the specific experimental conditions necessary to perform an experiment, descriptions of the population containing the variation and frequency information by population or individual genotype.
The current level of activity in the discovery of general sequence variation suggests that SNP markers with unknown selective effects will be the majority of submitted records. Although most submissions are currently for Homo sapiens, dbSNP already has submissions for Mus musculus, and in general the database can accept variation information from any species and from any part of a particular genome. dbSNP also links variations (polymorphisms and clinical mutations) to other NCBI sequence resources via BLAST and E-PCR analysis of the flanking sequence that immediately surrounds the variation. Links to the literature databases are made with the citation information provided at submission time. This integration process makes dbSNP part of the NCBI discovery space. In this model, dbSNP serves dual roles as both a first point of entry into the resource network for query and retrieval of specific variation records, and as an information server for searches that start in other resources such as GenBank, PubMed, LocusLink or the genome sequence databases.

Mutation databases of human genes are assuming an increasing importance in all areas of health care. In addition, more and more experts in the mutations and diseases of particular genes are curating published and unpublished mutations in locus-specific databases (LSDB). These databases contain such extensive information that they have become known as knowledge bases.  This database content was analyzed(~ref) between June 21, 2001, and July 18, 2001and we were able to access 94 independent websites devoted to the documentation of mutation containing 262 LSDBs for study. The study found that there are 23,822 mutations recorded with 1518 polymorphisms. Fifty-four percent of the LSDBs studied were easy to use and 11\% hard to follow; 73\% of the databases were displayed through HTML. Three databases were found that were given a high score for ease of use and wealth of content. Thus, the study provided a strong case for uniformity of data to make the content maximally useful. In this direction, a hypothetical content for an ideal LSDB was derived. We also derived a community structure that would enhance the chances of mutation capture rather than being left unpublished in a patient's report. We hope the interested community and granting bodies will assist in achieving the vision of a public system that collects and displays all variants discovered.

The listing of the mutations in the globin gene(s) was in fact the first locus specific mutation database (LSDB), where the main author was interested in collecting the details of the mutation and the phenotype. Today there are around 260 LSDBs mounted on nearly 100 websites. These databases vary in almost every aspect (except those on the same website where their characteristics are similar), because not only do they use 10 or so different software types but also the initiators have had different interests and different objectives in mind. Also, some are better funded than others so appear more professional. There are three main types of LSDBs, those focusing on the mutation only and describing only the first example of each, eg the PAH database, those cataloguing patients with specific diseases and noting the mutations, eg MUTBASE and those cataloguing somatic mutations, eg TP53.

HbVar (http://globin.bx.psu.edu/hbvar) is a locus-specific database (LSDB) developed in 2001 by a multi-center academic effort to provide timely information on the genomic sequence changes leading to hemoglobin variants and all types of thalassemia and hemoglobinopathies. Database records include extensive phenotypic descriptions, biochemical and hematological effects, associated pathology, and ethnic occurrence, accompanied by mutation frequencies and references. In addition to the regular updates to entries, we report significant advances and updates, which can be useful not only for HbVar users but also for other LSDB development and curation in general. The query page provides more functionality but in a simpler, more user-friendly format and known single nucleotide polymorphisms in the human alpha- and beta-globin loci are provided automatically. Population-specific beta-thalassemia mutation frequencies for 31 population groups have been added and/or modified and the previously reported delta- and alpha-thalassemia mutation frequency data from 10 population groups have also been incorporated. In addition, an independent flat-file database, named XPRbase (http://www.goldenhelix.org/xprbase), has been developed and linked to the main HbVar web page to provide a succinct listing of 51 experimental protocols available for globin gene mutation screening. These updates significantly augment the database profile and quality of information provided, which should increase the already high impact of the HbVar database, while its combination with the UCSC powerful genome browser and the ITHANET web portal paves the way for drawing connections of clinical importance, that is from genome to function to phenotype.

There are two major differences between LSDBs and Central Databases that have important consequences regarding utility for specific purposes. First is that LSDBs are run by experts in the gene involved and secondly most of them collect unpublished mutations. The consequence of the first point is that many of the LSDBs are more knowledge bases of the genes, eg PAH12 with enormous amounts of information ranging from that for biochemists to that for patients. On the second point, the consequences are that a recent survey16 showed that LSDBs contained around 100 \% more mutations than HGMD that only collects published mutations.

Central or general mutation databases collect mutation in all genes but those that exist differ because of their reasons for being initiated. These have recently been reviewed. OMIM began as a systematic record of inherited syndromes in print form. As genes causing the syndromes were identified, the records in this compilation began to include mutations identified in such genes. Because it cannot keep up with all mutations it only collects the first mutation and then the most interesting after that. For example in cystic fibrosis and phenylketonuria (12 September 2001), OMIM contains 127 of 989 and 65 of 443 mutations in these diseases respectively, compared with the mutations in the Locus Specific Database for these genes. HGMD began as a research tool to document the different types of mutations occurring in humans and ultimately led to the finding that mutations in CpG doublets were the most frequent and then to exploration of why this was so. This collection from the published literature has become a useful compilation so that users could find if a particular mutation had been described and, if so, who by and where. 

With all the feathers we observed in the International human mutation database we mentioned above and recognizing the importance of mutation effects in Thailand, we established a National and Ethnic Mutation Database (NEMDB) for Thai people. This database, named Thailand Mutation and Variation database (ThaiMUT), offers a web-based access to genetic mutation and variation information in Thai population. This NEMDB initiative is an important informatics tool for both research and clinical purposes to retrieve and deposit human variation data. The mutation data cataloged in ThaiMUT database were derived from journal articles available in PubMed and local publications. 


\section{Related Works on standard human mutation nomenclature}

Consistent gene mutation nomenclature is essential for efficient and accurate reporting, testing, and curation of the growing number of disease mutations and useful polymorphisms being discovered in the human genome. While a codified mutation nomenclature system for simple DNA lesions has now been adopted broadly by the medical genetics community, it is inherently difficult to represent complex mutations in a unified manner. In this article, suggestions are presented for reporting just such complex mutations.

A nomenclature system has recently been suggested for the description of changes (mutations and polymorphisms) in DNA and protein sequences. These nomenclature recommendations have now been largely accepted. However, current rules do not yet cover all types of mutations, nor do they cover more complex mutations. This document lists the existing recommendations and summarizes suggestions for the description of additional, more complex changes. As part of the Human Genome Variation Society (formerly known as the HUGO Mutation Database Initiative), a committee was formed to suggest standards for the description of sequence variants in DNA, RNA, and protein sequences. The committee proposed that the nomenclature should be unequivocal, precise, and short, and should prevent any possible confusion and follow existing practice as much as possible. To "spread the word," the nomenclature rules were published at regular intervals. This unit summarizes these nomenclature recommendations, which stimulated a uniform and unequivocal description of sequence variants in literature.

To translate basic research findings into clinical practice, it is essential that information about mutations and variations in the human genome are communicated easily and unequivocally. Unfortunately, there has been much confusion regarding the description of genetic sequence variants. This is largely because research articles that first report novel sequence variants do not often use standard nomenclature, and the final genomic sequence is compiled over many separate entries. In this article, we discuss issues crucial to clear communication, using examples of genes that are commonly assayed in clinical laboratories. Although molecular diagnostics is a dynamic field, this should not inhibit the need for and movement toward consensus nomenclature for accurate reporting among laboratories. Our aim is to alert laboratory scientists and other health care professionals to the important issues and provide a foundation for further discussions that will ultimately lead to solutions.

Human Genome Variation Society (formerly known as the HUGO Mutation Database Initiative, http://www.HGVS.org), a committee was formed to suggest standards for the description of sequence variants in DNA, RNA, and protein sequences. The committee proposed that the nomenclature should be unequivocal, precise, and short, and should pre- vent any possible confusion and follow existing practice as much as possible. The nomenclature rules were published at regular intervals (Beaudet and Tsui 1993; Antonarakis and the Nomenclature Working Group, 1998; den Dunnen and Antonarakis, 2000). These nomenclature recommendations stimulated a uniform and unequivocal description of sequence variants in literature and have now been largely accepted by the scientific community. The recommendation rules from HUGO could be summarized as follow

\subsection{Reference Sequence}
	Descriptions should always be given in relation to a reference sequence, i.e., the sequence establishing the numbering for all individual residues (nucleotides or amino acids) in the sequence studied. A publication should clearly state which reference sequence was used. This reference file should be present in a sequence database (DDBJ, EBI, or NCBI; UNIT 6.7) and can be referred to by its accession and version number (example GenBank AB012345.2). Preferably, this file should be present in the RefSeq database, or, when no proper file is present, a RefSeq file should be made, accurately curated, and submitted. The establishment of the reference sequence is difficult and fluid, until the human genome is fully sequenced and the scientific community reaches a consensus on permanent DNA sequence coordinates.
	
\vspace{-1em}
\begin{enumerate}
	\item {\bf Genomic reference sequence}
	At the DNA level, a genomic reference sequence is preferred since it overcomes difficult cases, including multiple transcription initiation sites (promoters) and translation initiation sites (ATG codons), alternatively spliced exons, and the use of different poly(A) addition sites. Since sequence variants in the promoter region might affect gene function, the genomic reference sequence should start well 5? of the promoter and transcription initiation site, cover the entire transcribed region, and end down-stream of the poly(A) addition site. Nucleotide numbering starts with 1 at the first nucleotide, and numbering should proceed straight to the end (Fig. 7.13.1). Numbering should not start at specific internal sites within the sequence; this would only unnecessarily complicate descriptions (e.g., one would have to know how to number bases 5? of nucleotide 1).
	\item {\bf cDNA reference sequence}
	When the complete genomic sequence is not known, a cDNA reference sequence should be used. The cDNA reference sequence should represent the major transcript of the gene. Alternative exons (5?-first, internal, or 3?-termi- nal) should be numbered in relation to this reference, as for intronic sequences (figure (~ref)). In a cDNA reference sequence, nucleotide 1 is the A of the ATG-translation initiation codon. Nucleotides going upstream (5?) of the ATG-translation initiation codon are numbered 1, 2, etc. (Fig. (~ref)). There is no nucleotide 0. Nucleotides at the beginning of an intron are numbered in reference to the last nucleotide of the directly upstream exon, followed by a plus sign and the position in the intron, e.g., 87+1, 87+2, etc. (alternatively IVS1+1, IVS2+2, etc.; Fig. 7.13.1). Nucleotides at the end of an intron are numbered in reference to the first nucleotide of the directly downstream exon, followed by a minus sign and the position in the intron, like 881, 882, etc. (alternatively IVS21, IVS2 2, etc.). The recommendation is to use the shortest description; thus, in the middle of the intron, nucleotide numbering changes from 87+ to 88.
	\item {\bf mRNA reference sequence}
	The mRNA reference sequence should be identical to the cDNA reference sequence, and thus represent the major transcript (mRNA) of the gene. Alternative exons (5?-first, internal, or 3?-terminal) should be numbered, as for the cDNA reference sequence. Nucleotide 1 is the A of the AUG translation-initiation codon. Nucleotides going upstream (5?) of the AUG translation-initiation codon are numbered 1, 2, etc. (Fig. (~ref)). There is no nucleotide 0.
	\item {\bf Protein reference sequence}
	The protein reference sequence should be based on the cDNA/mRNA reference sequence, i.e., the translation product encoded, annotated as ``CDS''
 (coding sequence) in this file. The first amino acid of the translation product, the translation-initiating methionine, is numbered as 1.
	\item {\bf Protein reference sequence}
	The protein reference sequence should be based on the cDNA/mRNA reference sequence, i.e., the translation product encoded, annotated as ``CDS''
 (coding sequence) in this file. The first amino acid of the translation product, the translation-initiating methionine, is numbered as 1.
	
\end{enumerate}

\begin{figure}
\begin{center}
\includegraphics[height=4in]{figure/nomen1.png}
\end{center}
\caption{nomen} \label{fig:refpos}
\end{figure}

\subsection{Discrimination Between DNA, RNA, and Protein}
To avoid confusion in the description of variants at different levels, the description should be preceded by a letter (followed by a period) indicating the type of reference se- quence used:
``g.''
 for a genomic sequence; g.76A>T ``c.''
 for a cDNA sequence; c.76A>T ``m.''
 for a mitochondrial sequence;
m.76A>T ``r.''
 for an RNA sequence; r.76a>u ``p.''
 for a protein sequence; p.Lys76Ala or
p.K76A. 

For the sake of a clear distinction, descriptions at DNA, RNA, and protein level differ uniquely. For DNA, the nucleotides are designated by the bases (in uppercase letters): A, C, G, and T. For RNA, the nucleotides are designated by the bases (in lowercase letters): a, c, g, and u. For protein, amino acids may be described using the one- or three-letter amino acid code (Table 7.13.1) with the X used to indicate a translation termination (stop) codon Although the one-letter code is currently recommended (Antonarakis and the Nomenclature Working Group, 1998), it often causes confusion for those amino acids that start with a letter other than their abbreviatione.g., in the A group, alanine is abbreviated as A, arginine as R, asparagine as N, and aspartic acid as D; the G group contains glutamine abbreviated as E, glutamic acid abbreviated as Q, and glycine abbreviated as G.
At DNA level, the description has a format such as ``g.5T>G''
 or ``c.5T>G''
 (Fig. 7.13.2), i.e., starting with a number to indicate the position and followed by a letter to represent the nucleotide. At RNA level, the description has a format such as r.5u>g, i.e., starting with a number to indicate the position and followed by a letter to represent the nucleotide. At protein level, the description has a format such as p.Leu3Val (or p.L3V, if the one-letter amino acid codes are used ), i.e., starting with a letter
code referring to the amino acid and followed by the number indicating its position. Table (~ref) lists additional symbols and abbreviations used in describing variants, with examples of how they might be used.
\vspace{-1em}
\begin{enumerate}
	\item {\bf Range}
	A range of affected residues (nucleotides or amino acids) is indicated by an ``\_''
-character (underscore), separating the first and last residue affected, e.g.; g.7\_8del or p.Gly4\_Gln6del (p.G4\_Q6del using the one-letter amino acid codes).
	\item {\bf Repeated sequences}
	For deletions or duplications in stretches of repeated sequences (single residues, tandem repeats, etc.), the most 3? (last) residue is arbitrarily assigned to have been changed. The variant ACTTTGTGCC to ACTTTG\_CC is thus described as g.7\_8delTG. recommendation is to use the shortest description; thus, in the middle of the intron, nucleotide numbering changes from 87+ to 88.
	\item {\bf Two variants in one allele}
	Two variants in the sequence of a gene on one chromosome (one allele) are described as ``[first variant; second variant]''
, e.g., [g.76A>C; g.83G>C].
	\item {\bf Two chromosomes}
	Variants in different alleles, e.g., in recessive diseases, are described as ``[variant allele 1]+[variant allele 2]''
, e.g., [g.76A>C]+ [g.87delG]. Homozygous variants are described as [g.76A>C]+[g.76A>C], while [g.76A>C]+[?] describes an A to C variant at nucleotide 76 in one allele and an unknown variant in the other allele.
	\item {\bf Two transcripts from one allele}
	When a variant affects RNA processing, yielding two or more transcripts, these are de- scribed as ``[transcript 1, transcript 2]''
. In this way, [r.76a>c, r.73\_88del] describes the nucleotide variant c.76A>C causing the appearance of two RNA molecules, one carrying this variation and one, due to a shift of the splice donor site to within the exon, containing a deletion of nucleotides 73 to 88.
	
\end{enumerate}

\subsection{Types of Variants: Descriptions}
At molecular level, for DNA, RNA, and protein, seven elementary types of variants exist: substitution, deletion, duplication, inser- tion, inversion, translocation, and transposition (Fig. 7.13.2; see UNIT 9.1 for a discussion of
types of mutations). Duplications can be considered as special types of insertions, but, since their origin at molecular level is different, they are treated separately. Since exon recognition (splicing) and translation are orientation-dependent, inversions occur only by chance in RNA and protein. At RNA and protein level, translocations yield fusion transcripts and fusion proteins respectively. Transpositions can be considered as a deletion of a sequence at one position combined with an insertion of that same sequence at another position. In reporting a transposition, for the gene analyzed it can be described as a deletion; the Remarks column can then contain an additional remark describing the insertion at another position. Some- times, variants are complex, i.e., consisting of two or more types with a deletion/insertion (i.e., indel) being the most common. After gene expression, several other types of variants are recognized, but these should be treated separately since they describe the consequence of a molecular change and not its nature. Among these we recognize effects on DNA transcribed into RNA and processed to mRNA (e.g., affect- ing splicing, polyadenylation, RNA editing, or RNA stability) and mRNA translated into protein and processed into mature protein (e.g., phosphorylation, glycosylation, protein-protein interaction, or protein localization). To describe the consequences of variants on translation, terms such as nonsense, missense, and frame shift are frequently used. In addition to these effects, variants may occur that influence the level of transcription/translation, the site of transcription/translation (e.g., tissue or place in the cell), or its timing during differentiation. Current nomenclature rules do not cover such cases, but focus on the elementary types mentioned above. When such complex effects are seen, they should be mentioned separately, e.g., in the ``Remarks''
 column of the tabular overview that lists all variants identified.

\vspace{-1em}
\begin{enumerate}
	\item {\bf Substitution}
	The most simple variant in a sequence is the substitution of one residue (nucleotide or amino acid) for another. Substitutions are described using a ``>''
 character. On the protein level, the ``>''
 character is not used. For example, g.5T>G indicates that the T at nt position 5 is changed to a G (Fig. 7.13.2), whereas p.Leu3Val (alternatively p.L3V or p.Leu3>Val) indicates that the leucine at amino acid position 3 is changed to a valine.
In the past, to discriminate them from likely pathogenic variants, nonpathogenic ``polymorphic''
 variants have been described as p.7T/G or p.Leu3Leu. The consensus is that this description should not be used.
	\item {\bf Deletion}
	Duplications are described using the abbreviation ``dup,''
 either directly following the number of the residue duplicated or after an indication of the range of the residues duplicated (i.e., the number of the first and last residue duplicated, separated by a ``\_''
 character).
g.6dup (alternatively g.6dupG) indicates that the G at nt 6 was duplicated (Fig. 7.13.2). g.3\_6dup (alternatively g.3\_6dupTTTG or g.3\_6dup4) indicates that the 4nts TTTG from 3 to 6 were duplicated.
p.Gly4\_Gln6dup (alternatively p.G4\_Q6dup) indicates that the amino acids from glycine-4 to glutamine-16 were duplicated.
Duplications can also be described as insertions; in the example above g.6\_7insTTTG describes the same variant as g.3\_6dup. However, since the molecular mechanism involves a duplication and the description is shorter, such variants should be described as a duplication and not as an insertion.
	
	\item {\bf Duplication}
	Insertions are described using the abbreviation ``ins,''
 directly following the number of the residues flanking the insertion site, separated by a ``\_''
 character. For example, g.5\_6insAA indicates	that	the	dinucleotide	sequence	A A was inserted between nucleotides 5 and 6 (Fig. 7.13.2), whereas p.Lys2\_Leu3insSer (alternatively p.K2\_L3insS) indicates that a serine was inserted between lysine-2 and leucine-3.
Note that duplicating insertions should be described as duplications (see discussion of Duplications, above). The description g.6insAA is not adequate, since this leads to confusion whether one means insertion ``at''
 or ``after''
 nucleotide 6. This confusion becomes even greater when intronic variants in relation to a cDNA reference sequence are described as c.88-6insT; what would ``after 88-6" mean?
The description should always include the entire sequence inserted; a description like g.5\_6ins2 is not sufficient. When larger stretches of sequence are inserted, the description will become rather lengthy. In such cases descriptions like g.5\_6ins9 and g.5\_6ins1295 are acceptable, but they should be combined with a footnote pointing to a description of the entire sequence inserted, either in the publication itself or as the accession number of a sequence database file submitted.
	
	\item {\bf Inversion}
	Inversions are described using the abbreviation ``inv,''
 directly following the number of the residues which are inverted, separated by a ``\_''
 character. For example, g.5\_9inv (alternatively g.5\_9inv5) indicates that the sequence of the 5 nts from 5 to 9 was inverted (Fig. 7.13.2), whereas p.Leu3\_Gln6inv (alternatively p.L3\_Q6inv) describes that the segment from amino acids leucine-3 to glutamine-6 was inverted. Note that such variants at protein level are mainly theoretical.
	
	\item {\bf Translocation}
	Since translocations include different chromosomes/genes (read different reference sequences) and usually contain additional rearrangements (deletion, duplication, or insertion), their description is rather complex. Specific descriptions for translocations have been used by cytogeneticists for years. They have the format of, e.g., ``t(X;4)(p21.2;q34)''
. In line with this, a description at DNA level could have the format t(X;4)(p21.2;q34) (g.857\_858), indicating that the translocation disrupts the gene studied between nucleotide 857 and 858 of the genomic reference sequence. The sequences of the translocation breakpoints need to be submitted to a sequence database and the accession numbers should be listed.
At RNA and protein level, translocations yield fusion transcripts and fusion proteins.
	
	\item {\bf Complex variants}
	Based on the rules described above, any complex variant can be described as a series of independent changes. The most frequent complex type is an insertion/deletion (``indel''
). These are described as a deletion followed by an insertion. g.112\_117delinsTG (alternatively g.112\_117delAGGTCAinsTG) describes the replacement of nucleotides 112 to 117 (AGGTCA) by TG. p.Cys28\_Lys29Trp (alternatively p.C28\_K29W) describes the outcome of
a 3-bp deletion affecting the codons for cysteine-28 and lysine-29, a codon for tryptophan.
	
	\item {\bf Special cases}
	Without experimental data, the consequence of a variant affecting the translation initiation codon is hard to predict. No protein may be produced or translation initiation may shift to an upstream or downstream site. For variants affecting Met1 the description ``p.? (Met1Val)''
 is recommended, indicating that it is unclear what the consequence of the variant is. When experimental data show that no protein is made, the description ``p.0" is recommended.
Many variants at DNA level result in a shift of the translational reading frame, generating premature translation termination. Such variants are often called frameshift changes. At the protein level, these variants are not described as a deletion/insertion but by using the abbreviation ``fs''
 after the first amino acid affected by the change (e.g., p.Cys9fs). The format ``fsX(number)''
 can also be used with the ``number''
 indicating the position of the stop codon in the new, shifted reading frame. p.Arg97fsX23 (alternatively, p.R97fs) describes a frame-shifting variant with Arginine97 as the first affected and the new reading frame ending in a stop after 23 codons.
	
\end{enumerate}
The recommendation in the using of nomenclature by HGVS are generally accepted, but some-how, lag of traceable property. With this recommendation, geneticists have to provide gene information together with nomenclature in order to trace back the position of mutation on the chromosome. In this framework, we adopted the nomenclature that follow ``Standard Mutation Nomenclature in Molecular Diagnostics Practical and Educational Challenges(~ref)''. In this article, to describe the standard nomenclature, one must describe it with 1) the GenBank accession number and version number of the coding DNA (or cDNA) reference sequence used, followed by 2) a colon ``:''; 3) the mutation nomenclature recommended by HUGO mentioned above. For example, in the nomenclature of the CFTR mutation ``NM\_000492.3:c.350G>A'' (ie, p.Arg117His),``NM\_000492.3'' indicates the GenBank cDNA reference sequence used, c. indicates that the nucleotide number ``35'' is based on coding DNA sequence, and G>A indicates that the nucleotide substitution is G to A. Other examples of standard and nonstandard colloquial nomenclature of genes and variants are listed in Table \ref{exnomen}.


\begin{table}[]
\caption{Standard and Colloquial Nomenclature of Common Gene Variants}
\label{table:exnomen}
\centering
\begin{tabular}{lll}
\hline
Standard nomenclature & Colloquial nomenclature & Associated disease \\
\hline

F2 AF478696.1: g.21538G>A (c.*97G>A) 	& 	Prothrombin G20210A (or 20210G>A)	&	Venous thrombosis	\\
F5 NM\_000130.3: c.1601G>A (p.Arg534Gln) 	&	 Factor V 1691G>A (R506Q)			&	Venous thrombosis\\
MTHFR NM\_005957.3: c.665C>T (p.Ala222Val) &	MTHFR C677T (or 677C>T)			 &	Homocystinemia\\
HFE NM\_000410.3: c.845G>A (p.Cys282Tyr) &	HFE C282Y						&	Hemochromatosis	\\		
HFE NM\_000410.3: c.187C>G (p.His63Asp)	&	HFE H63D						&	Hemochromatosis\\

 
\hline
\end{tabular}
\end{table}


\section{Related Works on software that generate standard nomenclature}

In Lydon, April 12, 2008 a XBG patient and his parents sued the department of clinical diagnosis in Lydon, the XBG mutation database, and the journal Human Mutation. The complaint was that serious and culpable mistakes were made during the clinical diagnosis of the pregnancy in the XBG- family, that ultimately led to the birth of an affected child. A paper published in Human Mutation listed the sequence variant detected in the family as "nonpathogenic." Careful examination would have revealed that the change was clearly pathogenic (a nonsense mutation). However, the accused parties failed to verify the data of the original report and just copied it. Because of the importance of the issue and the overall consensus on the rules, Human Mutation is adopting an editorial policy that requests absolute compliance of these mutation nomenclature rules before manuscripts will be accepted and published.

As the human genome sequencing project nears completion, there has been a vast increase in the rate at which disease and nondisease associated variant sequences are being sought and detected. This has heightened the need for software with which to accumulate allelic variant (mutation) data, and with which to make the data accessible to the scientific community. Many ad hoc solutions have been developed by those interested in specific genes and diseases, and the creation of central databases which hold data for all genes has provided an alternative repository for some of the locus data. Despite this, few specialised software tools exist for researchers to create their own locus-specific allelic variant databases. This article describes methods available to potential curators, including software systems developed with the sole purpose of generating locus-specific mutation databases. In particular, the authors' own software, MuStaRtrade mark, is described. MuStaRtrade mark allows curators to maintain a database on a laptop computer if desired, while being able to export the data to an automatically generated Website which will run on any cgi compliant Web server. Searching the database and the submission of new mutations are made possible through fill-in Web forms. A number of other software tools which may be of use to curators are also described.

Mutalyzer(~ref) is the first tools that used for generate unambiguous, correct sequence variant description and automated analysis and correction of sequence variant descriptions using reference sequences from any organism. Mutalyzer handles most variation types: substitution, deletion, duplication, insertion, indel, and splice-site changes following current recommendations of the Human Genome Variation Society (HGVS). Input is a GenBank accession number or an uploaded reference sequence file in GenBank format with user-modified annotation, an HGNC gene symbol, and the variant (single or in a batch file). Mutalyzer generates variant descriptions at DNA level, the level of all annotated transcripts and the deduced outcome at protein level. To validate Mutalyzer's performance and to investigate the sequence variant description quality in locus-specific mutation databases (LSDBs), more than 11,000 variants in the PAH, BIC BRCA2, and HbVar databases were analyzed, showing that 87\%, 25\%, and 38\%, respectively, were error-free and following the recommendations. Low recognition rates in BIC and HbVar (38\% and 51\%, respectively) were due to lack of a well-annotated genomic reference sequence (HbVar) or noncompliance to the guidelines (BRCA2). Provided with well-annotated genomic reference sequences, Mutalyzer is very effective for the curation of newly discovered sequence variation descriptions and existing LSDB data. Mutalyzer will be linked to the Leiden Open source Variation Database (LOVD) (www.LOVD.nl; last accessed 13 September 2007) and is the first module of a sequence variant effect prediction package.

\begin{figure}[ht]
\begin{center}
\includegraphics[height=4.5in]{figure/mutalyzer.png}
\end{center}
\caption{Mutalyzer user interface.} 
\label{fig:mutalyzergui}
\end{figure}

In the process of drawing up a computerized operation reporting system, a nomenclature for the precise description of recently fitted or existing aortocoronary bypasses has been developed. This is based on a sequence of letters showing in one line which type of bypass has been fitted, the graft material used, the central anastomosis (source) as well as the peripheral anastomoses on the coronary arteries (objective). For this purpose, abbreviations of the customary terms in use in cardiac surgery have been used. A computer graphics programme has been created in parallel, enabling all bypasses (existing and/or new) to be sketched into the diagram of a heart with the aid of a mouse. The bypass nomenclature is automatically generated from the diagram, which can also be printed out as a sketch of the operation. The complete diagram of the heart plus data input forms enable the operation report to be compiled automatically. The nomenclature and the graphics programme are easily learnt, simplify work, can readily be incorporated into a computerized hospital organization and enhance documentation quality.

From the exist mutation nomenclature we can conclude the features that not be found in these tools are following.
\vspace{-1em}
\begin{enumerate}
	\item {\bf Gene Search}
	The annotation can be done on the results of gene search of online or bundled database(provided in online and offline version of MUTANT). Geneticists can select isoform and reference version of DNA sequence.
	\item {\bf Nomenclature visualization}
	The visualization of mutation nomenclature help geneticists to specify the position of mutation on chromsome these can lead to two advantages; one is to confirm the correctness of mutation position in both forward and reverse strand of DNA sequence; two geneticists can view the mutation spectrum on chromosome which might link to the correlation of each mutation and linkage-disequilibrium.
	
	\item {\bf Nomenclature are generated various references and level}
	Mutation nomenclature are generated in both DNA and protein level, on each version also ้have reference on both gene and intron position. This will provide convenient to check the result from technician compared to the chromatogram result from lab.
	
	\item {\bf Compatibility of mutation annotation files}
	Application saved file must compatible with other genome browser application for e.g. GBrowse(~ref). To grant geneticists the view of mutation compare to other variant like SNP. Displaying mutation on generic browser also give the user the spectrum of mutation on various position of chromosome.
	
	
	
\end{enumerate}





